10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
I have R Studio installed, where packages are stored in the following location: "/Library/Frameworks/R.framework/Versions/3.4/Resources/library" However, this is not the same location for packages in Anaconda: "/Users/Ajay/anaconda/lib/R/library". We thus add the latter path for use in our jupyter notebook
In [1]:
.libPaths("/Library/Frameworks/R.framework/Versions/3.4/Resources/library")
.libPaths() #Added R Studio's Package Path to Anaconda's Path
library(ISLR)
library(MASS)
Let us take a look at some values of the Weekly Dataset.
In [2]:
Weekly = na.omit(Weekly)
head(Weekly) #Use fix(Weekly) in R Studio
There are 1089 samples with 8 covariates and 1 response variable (Direction)
In [6]:
dim(Weekly)
Let us generate a summary for every variable
In [94]:
summary(Weekly)
From the statistics above, we can conclude that all 8 covariates are quantitative. The year could be quantitative as it has choices from 21 values.
In [7]:
unique_years = unique(Weekly$Year)
length(unique_years)
Now let us take a look at the relationships between these variables with a scatterplot matrix. We will remove the categorical response variable Direction as it's plot is meaningless.
In [96]:
pairs(Weekly[,-c(length(Weekly))])
From the matrix above, there doesn't seem to be any relationship between the covariates except for Year and Volume. Let us numerically identify collinear relationships with a correlation matrix.
In [97]:
cor(Weekly[,-c(length(Weekly))])
Our Scatterplot and Correlation matrix agree. The only significant value is the correlation between Year and Volume.
10. (b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
The question asks us to use the 5 lag variables and Volume as the predictors. But let us also include Today among our covariates and train a model with logistic regression.
In [105]:
cols = names(Weekly[-c(1, length(Weekly))]) #Excluding Year
covariates = paste(cols, collapse="+")
response = names(Weekly[length(Weekly)])
glm(paste(response,"~",covariates),
data=Weekly,
family=binomial)
The error above is an intance of the Hauck-Donner effect, also known as perfect separation. It occurs when the minimum value of the covariate for a particlar resonse class is greater than the maximum value of the covariate for the other response class. Let us find the values of Today.
In [149]:
todays_dep = Weekly[,8:9]
down_samples = todays_dep[todays_dep$Direction == "Down",]
up_samples = todays_dep[todays_dep$Direction == "Up",]
print(paste("Number of Down samples = ", dim(down_samples)[1]))
print(paste("Number of Up samples = ", dim(up_samples)[1]))
#dim(up_samples)
print(paste("Maximum value of Today for Down Samples = ",max(down_samples$Today)))
print(paste("Minimum value of Today for Up Samples = ",min(up_samples$Today)))
Clearly, the minimum value of Today for upsamples is greater than the maximum value of Today for Down Samples. This leads to a regression curve of infinite slope and negative infinite intercept. So we remove this column for now and retrain our model.
In [151]:
cols = names(Weekly[2:7]) #Excluding Year
covariates = paste(cols, collapse="+")
response = names(Weekly[length(Weekly)])
model = glm(paste(response,"~",covariates),
data=Weekly,
family=binomial)
summary(model)
The Null hypothesis $H_0$ is states that the coefficients associated with a variable $X_i$ to Predict the response variable $y$ is 0.
$$ H_0: \textit{No Relationship between Predictor } X_i \textit{ and response variable } y $$We reject the null bypothesis for $p<0.05$. Of the Coefficients above, only Lag 2 is statistically significant, evidenced by it's small p-value of 2.96%.
10 (c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
In [159]:
probs = predict(model, type="response")
predictions = rep("Down", length(probs))
predictions[probs > 0.5] = "Up"
In [161]:
table(predictions, Weekly$Direction)
In [162]:
(54+577)/(54+577+48+430)
This classifier has a training error rate of $100-56.9 = \textit{43.1%} $. But Training error usually underestimates performance of the classifier. The confusion matrix above displays predictions along the rows and the actual values along the columns.
Considering that the Down category as the positive class, we can compute the following performance measures: $$ Precision = \frac{TP}{TP + FP} = \frac{54}{54 + 48} = 0.5294 $$
$$ Recall = \frac{TP}{TP + FN} = \frac{54}{54 + 430} = 0.1115 $$$$ F-Measure = \frac{2PR}{P+R} = \frac{2 \times 0.5294 \times 0.1115}{0.5294+ 0.1115} = 0.1842 $$From these metrics, we garner logistic regression is not accurately predicting the days on which the predicted Direction is actually down (evidenced by low precision).
10 (d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
In the previous subsections, we trained and tested against tested against the same data. But we will now perform the Validation Set Approach for cross validation by holding back some training data. First, we will get rid of all predictors except Lag 2.
In [3]:
data = Weekly[c(1,3,length(names(Weekly)))] #Get Rid of all Predictors Except Lag2 and Year
head(data)
We now split the data into train and test (or validation) sets. and get rid of the Year Covariate (1st Column).
In [4]:
train = data[data['Year'] != 2009,-1]
test = data[data['Year'] == 2009,-1]
dim(train)
dim(test)
When reduced to a single column, the test set gets converted from a data.frame object to it's constitutent type numeric. I need to prederve the data.frame object to make predictions using the predict method.
In [234]:
X_test = data.frame(Lag2=test[,-2]) #Convert vector to a data.frame
In [233]:
model = glm(Direction~Lag2 ,data=train, family=binomial)
probs = predict(model, X_test, type="response")
predictions = rep("Down", nrow(X_test))
predictions[probs > 0.5] = "Up"
table(predictions, test[,-1])
We can compute testing accuracy:
$$ \textit{Simple Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{4+25}{4+25+4+19} = 0.5576 = \textbf{55.76%} $$Let us also take a look at the precision, recall, and F-Measure.
$$ Precision = \frac{TP}{TP + FP} = \frac{4}{4 + 4} = 0.5 $$$$ Recall = \frac{TP}{TP + FN} = \frac{4}{4 + 19} = 0.1739 $$$$ F-Measure = \frac{2PR}{P+R} = \frac{2 \times 0.5 \times 0.1739}{0.5 + 0.1739} = 0.258 $$This would be better than the testing performance on the same data when all predictors are considered. Like I mentioned before, Lag2 is the only predictor that was statistically significant.
10 (e) Repeat (d) using LDA
In [241]:
model = lda(Direction~Lag2 ,data=train)
predictions = predict(model, X_test)
table(predictions$class, test[,-1])
Logistic Regression and LDA have similar performances for prediction of Direction given Lag2
10 (f) Repeat (d) using QDA
In [242]:
model = qda(Direction~Lag2 ,data=train)
predictions = predict(model, X_test)
table(predictions$class, test[,-1])
QDA only predicts an upward trend for every sample
$$ \textit{Simple Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{0+29}{0+29+0+23} = 0.5576 = \textbf{55.76%} $$We still get the same (still poor) performance.
10 (g) Repeat (d) with KNN taking K=1.
We need to provide 4 parameters to knn():
In [13]:
library(class)
X_train = data.frame(Lag2=train[,1])
X_test = data.frame(Lag2=test[,1])
y_train = train[,2]
y_test = test[,2]
predictions = knn(X_train, X_test, y_train, k=k)
table(predictions, y_test)
The response variable needs to be a vector, not a data.frame. Considering Down as the positive class, let us take a look at simple accuracy:
$$ \textit{Simple Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{13+14}{13+14+15+10} = 0.5192 = \textbf{51.92%} $$Let us also take a look at the precision, recall, and F-Measure.
$$ Precision = \frac{TP}{TP + FP} = \frac{13}{13 + 15} = 0.4642 $$$$ Recall = \frac{TP}{TP + FN} = \frac{13}{13 + 10} = 0.5652 $$$$ F-Measure = \frac{2PR}{P+R} = \frac{2 \times 0.4642 \times 0.5652}{0.4642 + 0.5652} = 0.5097 $$10 (h) Which of these methods appears to provide the best results on this data?
Logistic Regression, LDA, and QDA give the same testing accuracy of $55.76$. However, the nature of the results of QDA is different from the other two. QDA only predicted a positive return for the market on a given week, regardless of the percentage return for the last 2 weeks Lag2.
$$ Precision_{QDA} = \frac{TP}{TP + FP} = \frac{0}{0+23} = \textbf{0} $$Although they have the same simple accuracy, QDA has 0 precision which makes this even worse than the other 2. LDA and Logistic regression have the same confusion matrix, hence the same performance measures. KNN with K=1 delivers with lower simple accuracy, but a higher F-Measure.
10 (i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
In [4]:
names(Weekly)
Let us try Logistic Regression. We split the data into test and train sets such that all observations recorded after 2007 are a part of the test set. Anything recorded earlier is in our train set.
In [12]:
train = Weekly[Weekly$Year < 2008, ]
test = Weekly[Weekly$Year >= 2008, ]
dim(train)
dim(test)
Let us eliminate Today and the response variable Direction from the model fitting process.
In [47]:
glm.fit = glm(Direction~.-Today-Direction, data=train, family="binomial")
glm.predictions = predict(glm.fit, test[,-c(length(test))], type="response")
preds = rep("Down", nrow(test))
preds[glm.predictions >= 0.5] = "Up"
table(preds, test[,length(test)])[2:1, 2:1]
Interaction variables constitute a product of covariates. Let us start with the Lag1 and Lag2 because the percentage returns from 1 week and 2 weeks previous may exhibit synergy.
In [80]:
glm.fit = glm(Direction~.+Lag1*Lag2-Today-Direction, data=train, family="binomial")
glm.predictions = predict(glm.fit, test[,-c(length(test))], type="response")
preds = rep("Down", nrow(test))
preds[glm.predictions >= 0.5] = "Up"
table(preds, test[,length(test)])[2:1, 2:1]
Not Quite. After trying out a combination of Lag variables, the best one was Lag3 and Lag5.
In [81]:
glm.fit = glm(Direction~.+Lag3*Lag5-Today-Direction, data=train, family="binomial")
glm.predictions = predict(glm.fit, test[,-c(length(test))], type="response")
preds = rep("Down", nrow(test))
preds[glm.predictions >= 0.5] = "Up"
table(preds, test[,length(test)])[2:1, 2:1]
In [82]:
summary(glm.fit)
Observation: Why did the performance increase when none of these coefficients are significant?
Now we will consider polynomial variables. We first just consider the best term Lag5.
In [14]:
glm.fit = glm(Direction~.+Lag5+I(Lag5^2)-Today-Direction, data=train, family="binomial")
glm.predictions = predict(glm.fit, test[,-c(length(test))], type="response")
preds = rep("Down", nrow(test))
preds[glm.predictions >= 0.5] = "Up"
table(preds, test[,length(test)])[2:1, 2:1]
It looks like there is an uptick in performance when the usual covariates and and $Lag5^2$ are used as covariates.
In [15]:
summary(glm.fit)
Let us see model performance with Linear Discriminant Analysis.
In [23]:
model = lda(Direction~.,data=train)
predictions = predict(model, test[,-c(length(test))])
table(predictions$class, test[,length(test)])[2:1,2:1]
We got rid of most of the False Positives seen in Logistic Regression. The addition of polnomial terms doesn't seem to help much.
In [38]:
print_table = function (model, param){
print(param)
predictions = predict(model, test[,-c(length(test))])
table(predictions$class, test[,length(test)])[2:1,2:1]
}
#Adding Lag1^2
model = lda(Direction~.+I(Lag1^2),data=train)
print_table(model, "Adding Lag1^2")
model = lda(Direction~.+I(Lag2^2),data=train)
print_table(model, "Adding Lag2^2")
model = lda(Direction~.+I(Lag3^2),data=train)
print_table(model, "Adding Lag3^2")
model = lda(Direction~.+I(Lag4^2),data=train)
print_table(model, "Adding Lag4^2")
model = lda(Direction~.+I(Lag5^2),data=train)
print_table(model, "Adding Lag5^2")
model = lda(Direction~.+I(Volume^2),data=train)
print_table(model, "Adding Volume^2")
Let us see how QDA classifies the direction.
In [40]:
model = qda(Direction~.,data=train)
print_table(model, "Normal Model with All variables")
We have a significant number of false positives. Furthermore, adding polynomial terms only hampers performance.
In [41]:
model = qda(Direction~.+I(Lag1^2),data=train)
print_table(model, "Adding Lag1^2")
model = qda(Direction~.+I(Lag2^2),data=train)
print_table(model, "Adding Lag2^2")
model = qda(Direction~.+I(Lag3^2),data=train)
print_table(model, "Adding Lag3^2")
model = qda(Direction~.+I(Lag4^2),data=train)
print_table(model, "Adding Lag4^2")
model = qda(Direction~.+I(Lag5^2),data=train)
print_table(model, "Adding Lag5^2")
model = qda(Direction~.+I(Volume^2),data=train)
print_table(model, "Adding Volume^2")
Observation: Wow! Adding a $Volume^2$ Term dramatically increased the number of false negatives!
Let us see the performance of KNN on this dataset. Initially, I consider $k=5$.
In [22]:
library(class)
X_train = train[,-length(train)]
X_test = test[,-c(length(test))]
y_train = train[,length(train)]
y_test = test[,length(test)]
predictions = knn(X_train, X_test, y_train, k=5)
table(predictions, y_test)
Not Bad. What happens if I include the square of say, $Lag5$ to our training data? I am doing this manually be creating a dataframe.
In [132]:
feature = data.frame()
for (val in train$Lag5){
feature = rbind(feature, val^2)
}
colnames(feature) = "Squared"
X_train_new = cbind(X_train, feature)
head(X_train_new)
Let us add this new feature to our training data
In [133]:
#Let us add the same for testing data.
feature = data.frame()
for (val in test$Lag5){
feature = rbind(feature, val^2)
}
colnames(feature) = "Squared"
X_test_new = cbind(X_test, feature)
head(X_test_new)
In [136]:
y_train = train[,length(train)]
y_test = test[,length(test)]
predictions = knn(X_train_new, X_test_new, y_train, k=5)
table(predictions, y_test)
It looks like there was a decrease in performance when $Lag5$ squared was included.